Added nerc_invoicing repo with functions for querying invoicing Iceberg table by jimmysway · Pull Request #267 · CCI-MOC/invoicing

jimmysway · 2026-02-17T15:57:27Z

Added nerc_invoicing repo for querying invoicing Iceberg table

Added functionality to get lifetime costs grouped by project

Added pyproject.toml for future publishing

Closes #266

knikolla · 2026-02-17T16:14:22Z

+BALANCE_COL = "Balance"
+
+
+def get_invoice_dataframe() -> pd.DataFrame:


You should cache this using functools.cache

knikolla

Previous comments still unresolved.

knikolla

Comments about using the config system and caching still applicable.

jimmysway · 2026-02-18T19:25:35Z

Comments about using the config system and caching still applicable.

Sorry I missed those I am addressing those now. I think caching the get_invoice_dataframe() function might be premature. The function returns a dataframe potentially for a specific filter if we are only querying this dataframe once there is little benefit from caching right now.

I will instead apply it to the _get_table() function

knikolla · 2026-02-18T19:46:42Z

Comments about using the config system and caching still applicable.

Sorry I missed those I am addressing those now. I think caching the get_invoice_dataframe() function might be premature. The function returns a dataframe potentially for a specific filter if we are only querying this dataframe once there is little benefit from caching right now.

I will instead apply it to the _get_table() function

Why do you think _get_table() is a better place for implementing caching?

jimmysway · 2026-02-18T20:20:16Z

Comments about using the config system and caching still applicable.

Sorry I missed those I am addressing those now. I think caching the get_invoice_dataframe() function might be premature. The function returns a dataframe potentially for a specific filter if we are only querying this dataframe once there is little benefit from caching right now.
I will instead apply it to the _get_table() function

Why do you think _get_table() is a better place for implementing caching?

I applied caching to _get_table() instead because it returns a reusable StaticTable handle (metadata + query interface), not full row data. Caching _get_table() mainly reduces repeated table-initialization overhead on every call.

If we were instead to cache the dataframe itself with get_invoice_dataframe(). We would instead be caching filter-specific dataframes which we don't reuse much right now.

I guess the question is what is the purpose of this library? Are we going to use as a thin wrapper for downstream processes to run predictable queries once a month, or are we going to be using this as a primary way to interface with Iceberg where it will be used as other processes or data tools to interface and run analytics on the data itself.

Currently, I think the library is going to be used as a thin wrapper. Since this module is just a thin wrapper, it should do the simple safe optimization (_get_table) and leave heavier caching to whatever service or job is using it.

jimmysway · 2026-02-18T23:01:04Z

+    return grouped_df
+
+
+def aggregate_by(


I've split the previous select_and_group into two functions aggregate_by and group_and_sum. I think the old function was doing too much and was misleading. This way group_and_sum alone is useful whenever you already have a DataFrame and just need deterministic aggregation.

QuanMPhm

Some comments I have for now. Will take a closer look and run some tests

QuanMPhm

Thank you for the good work. Just a few more comments mostly regarding style.

QuanMPhm · 2026-02-20T20:06:08Z

+    group_by: tuple[str, ...],
+    *,
+    agg_col: str,
+    agg_name: str = "total",


@knikolla Is there a strong need for the output column to have a custom name? This function already only outputs the grouped and output columns, so without agg_name, there would not be any ambiguity.

a_summed = a.groupby([INVOICE_DATE_FIELD, PROJECT_FIELD], dropna=False, as_index=False)["Balance"].agg("sum") print(a_summed.head()) Invoice Month Project - Allocation Balance 0 2024-03 False 6.43 1 2024-03 True 403.30 2 2024-03 NaN 3.30 3 2024-04 False 1000.00 4 2024-04 True 510.00

I just don't want unneeded features for now.

QuanMPhm · 2026-02-26T16:40:57Z

+    expression = costs._row_filter(**filters)
+    assert isinstance(expression, costs.And)
+    assert isinstance(expression.left, costs.EqualTo)
+    assert isinstance(expression.right, costs.EqualTo)


I would suggest a stricter check, like an equality:

Suggested change

expression = costs._row_filter(**filters)

assert isinstance(expression, costs.And)

assert isinstance(expression.left, costs.EqualTo)

assert isinstance(expression.right, costs.EqualTo)

expression = costs._row_filter(**filters)

assert isinstance(expression, costs.And)

assert expression.left == costs.EqualTo(PID, "vllm-test")

...

The test checks that the right type of expression was built, which I think is sufficient. Checking exact values is not really useful imho because if _row_filter garbles the values, group_and_sum and aggregate_by tests would catch it anyway since they operate end-to-end with real data.

The test checks that the right type of expression was built, which I think is sufficient. Checking exact values is not really useful

I disagree. This unit test is the only one letting us understand the expected behavior of _row_filter. We should know exactly the expected output should be, and in this is case it is not difficult to do so.

because if _row_filter garbles the values, group_and_sum and aggregate_by tests would catch it anyway since they operate end-to-end with real data.

get_invoice_dataframe, the only function that calls _row_filter, is heavily mocked in your test cases. None of your test cases deliberately test the filtering feature because we lack an integration test. Therefore, I'll argue none of the tests you presented adequately test _row_filter

QuanMPhm · 2026-02-26T17:41:17Z

+def test_aggregate_by_rounds_and_forwards_filters(
+    monkeypatch: pytest.MonkeyPatch, sample_invoice_dataframe: pd.DataFrame
+):
+    captured: dict[str, object] = {}
+
+    def _fake_loader(cols=None, **filters):
+        captured["cols"] = cols
+        captured["filters"] = filters
+        return sample_invoice_dataframe
+
+    monkeypatch.setattr(costs, "get_invoice_dataframe", _fake_loader)


Have you considered unittest.mock, or pytest-mock to mock get_invoice_dataframe? Mocking will provide you with helper functions to capture how get_invoice_dataframe was called, removing the need for _fake_loader. There's a few examples of this mocking in this repo

This suggest would also help remove some mocking code you have further in your test cases, such as for class _FakeIcebergTable

I think we should keep the _FakeIcebergTable. I believe having code that is understandable is better.

_FakeIcebergTable explicitly documents the .scan().select().to_pandas() chain the code depends on. Which is very useful to understand the context which is something that a MagicMock will abstract away.

The purpose of the test case is to see if get_invoice_dataframe emits a warning if the filtered dataframe is empty. It is explicitly not concerned with the internals of the iceberg table, which is why this is mocked out in the first place. A developer would review a test case to understand the expected behavior, not the internals of how that is done.

Looking at it a bit more, I would argue asserting that the output equals the input is moot here, since you setup the mock to return the same dataframe. If we wanted to test the pyiceberg code, we would need an integration test, which can be a part of this PR or a seperate one. @knikolla Thoughts?

QuanMPhm · 2026-03-10T19:17:39Z

@jimmysway Some CI test cases are broken

Added functionality to get lifetime costs grouped by project Added pyproject.toml for future publishing Added testing for new functionality

…o alert based on the "Cost" column I have made the changes to the code as necessary, switching from summing "Balance" to "Cost" instead Added equality check for row_filter test

QuanMPhm · 2026-04-09T18:02:17Z

@knikolla @naved001 I made some small amendments to this PR. I have one question here

jimmysway requested review from QuanMPhm and knikolla February 17, 2026 15:58

knikolla requested changes Feb 17, 2026

View reviewed changes

jimmysway force-pushed the feature/266-add-invoicing-library branch 4 times, most recently from 8b3abd3 to 73b5ca0 Compare February 17, 2026 20:18

knikolla requested changes Feb 17, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

Comment thread pyproject.toml Outdated

Comment thread process_report/data_tools/costs.py Outdated

jimmysway requested a review from knikolla February 17, 2026 22:03

knikolla requested changes Feb 17, 2026

View reviewed changes

jimmysway force-pushed the feature/266-add-invoicing-library branch from 73b5ca0 to 42f08b4 Compare February 17, 2026 22:25

knikolla requested changes Feb 17, 2026

View reviewed changes

jimmysway force-pushed the feature/266-add-invoicing-library branch from 42f08b4 to a2fbd6f Compare February 18, 2026 19:27

knikolla requested changes Feb 18, 2026

View reviewed changes

Comment thread process_report/data_tools/costs.py Outdated

Comment thread .github/ISSUE_TEMPLATE/custom.md

Comment thread process_report/data_tools/config.py Outdated

Comment thread process_report/data_tools/__init__.py Outdated

Comment thread process_report/data_tools/costs.py Outdated

jimmysway force-pushed the feature/266-add-invoicing-library branch from a2fbd6f to 6239016 Compare February 18, 2026 22:42

jimmysway closed this Feb 18, 2026

jimmysway force-pushed the feature/266-add-invoicing-library branch from 6239016 to 4920ec7 Compare February 18, 2026 22:52

jimmysway reopened this Feb 18, 2026

jimmysway commented Feb 18, 2026

View reviewed changes

QuanMPhm requested changes Feb 20, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

Comment thread pyproject.toml Outdated

Comment thread process_report/data_tools/config.py Outdated

Comment thread process_report/tests/unit/data_tools/test_data_tools.py Outdated

jimmysway force-pushed the feature/266-add-invoicing-library branch from 31f1732 to 7ef0bf9 Compare February 24, 2026 15:24

jimmysway requested review from QuanMPhm and knikolla and removed request for QuanMPhm February 24, 2026 15:24

jimmysway force-pushed the feature/266-add-invoicing-library branch from dee51ba to 89d21b0 Compare February 24, 2026 15:36

QuanMPhm requested changes Feb 26, 2026

View reviewed changes

jimmysway force-pushed the feature/266-add-invoicing-library branch 2 times, most recently from 89d21b0 to 97a1923 Compare March 2, 2026 16:14

jimmysway requested a review from QuanMPhm March 4, 2026 14:19

jimmysway force-pushed the feature/266-add-invoicing-library branch from 9a1a7bc to c96e05a Compare March 19, 2026 14:03

QuanMPhm mentioned this pull request Apr 8, 2026

Track initialized and used columns in each processor #285

Merged

jimmysway added 2 commits April 9, 2026 11:40

Added nerc_invoicing repo for querying invoicing Iceberg table

618e23f

Added functionality to get lifetime costs grouped by project Added pyproject.toml for future publishing Added testing for new functionality

As part of our ongoing lifetime alerting discussion we have decided t…

ff60637

…o alert based on the "Cost" column I have made the changes to the code as necessary, switching from summing "Balance" to "Cost" instead Added equality check for row_filter test

QuanMPhm force-pushed the feature/266-add-invoicing-library branch from c96e05a to ff60637 Compare April 9, 2026 18:01

QuanMPhm requested a review from naved001 April 9, 2026 18:55

		BALANCE_COL = "Balance"


		def get_invoice_dataframe() -> pd.DataFrame:

Uh oh!

Conversation

jimmysway commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

knikolla left a comment

Choose a reason for hiding this comment

Uh oh!

knikolla left a comment

Choose a reason for hiding this comment

Uh oh!

jimmysway commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

knikolla commented Feb 18, 2026

Uh oh!

jimmysway commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuanMPhm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

QuanMPhm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuanMPhm Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuanMPhm Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

QuanMPhm commented Mar 10, 2026

Uh oh!

QuanMPhm commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

jimmysway commented Feb 17, 2026 •

edited

Loading

jimmysway commented Feb 18, 2026 •

edited

Loading

QuanMPhm left a comment •

edited

Loading

QuanMPhm Feb 26, 2026 •

edited

Loading

QuanMPhm Mar 10, 2026 •

edited

Loading